Introduction

The goal of midterm is to apply some of the methods for supervised and unsupervised analysis to a new dataset. We will work with data characterizing the relationship between wine quality and its analytical characteristics available at UCI ML repository as well as in this course website on canvas. The overall goal will be to use data modeling approaches to understand which wine properties influence the most wine quality as determined by expert evaluation. The output variable in this case assigns wine to discrete categories between 0 (the worst) and 10 (the best), so that this problem can be formulated as classification or regression – here we will stick to the latter and treat/model outcome as continuous variable. For more details please see dataset description available at UCI ML or corresponding file in this course website on canvas. Please note that there is another, much smaller, dataset on UCI ML also characterizing wine in terms of its analytical properties – make sure to use correct URL as shown above, or, to eliminate possibility for ambiguity, the data available on the course website in canvas – the correct dataset contains several thousand observations. For simplicity, clarity and to decrease your dependency on the network reliability and UCI ML availability you are advised to download data made available in this course website to your local folder and work with this local copy.

There are two compilations of data available under the URL shown above as well as in the course website in canvas – separate for red and for white wine – please develop models of wine quality for each of them, investigate attributes deemed important for wine quality in both and determine whether quality of red and white wine is influenced predominantly by the same or different analytical properties (i.e. predictors in these datasets). Lastly, as an exercise in unsupervised learning you will be asked to combine analytical data for red and white wine and describe the structure of the resulting data – whether there are any well defined clusters, what subsets of observations they appear to represent, which attributes seem to affect the most this structure in the data, etc.

Finally, as you will notice, the instructions here are terser than in the previous homework assignments. We expect that you use what you’ve learned in the class to complete the analysis and draw appropriate conclusions based on the data. All approaches that you are expected to apply here have been exercised in the preceeding weekly assignments – please feel free to consult your submissions and/or official solutions as to how they have applied to different datasets. As always, if something appears to be unclear, please ask questions – we may change to private mode those that in our opinion reveal too many details as we see fit.

Sub-problem 1: load and summarize the data

Download and read in the data, produce numerical and graphical summaries of the dataset attributes, decide whether they can be used for modeling in untransformed form or any transformations are justified, comment on correlation structure and whether some of the predictors suggest relationship with the outcome.

Red Wine:

#Load red.wine data
red.wine = read.csv(file="winequality-red.csv",header=TRUE,sep=";")
#Dimentions of data
dim(red.wine)
## [1] 1599   12
#List first 6 rows with headers
head(red.wine)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                                    67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
#Summarizing data
summary(red.wine)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
#Boxplodding each attribute
old.par <- par(mar=c(1,1,1,1),mfrow=c(1,12),ps=6)
for(i in colnames(red.wine)){
boxplot(red.wine[,i],main=i)
}

par(old.par)

#Pair/Plot all 12 attributes of untransformed data
pairs(red.wine,pch=19,cex=red.wine$quality^3/1000, col=hsv(0, 1, 1-red.wine$quality/10))

#Pair/Plot all 12 attributes of log-transformed data
pairs(red.wine+1,pch=19,cex=red.wine$quality^3/1000,log="xy", col=hsv(0, 1, 1-red.wine$quality/10))

#Outlier detecting
#plot(red.wine[,c(7,1)])
#identify(red.wine[,c(7,1)])

#old.par<-par(mfrow=c(2,2))
#plot(lm(red.wine$quality~red.wine$fixed.acidity+red.wine$total.sulfur.dioxide, red.wine))

#1080 and 1082 rows are outliers as can be seen in plots of "total.sulfur.dioxide". But do not infuelnce the model red.wine$quality~red.wine[,1]+red.wine[,7] significantly


#Checking if there is zeros or NA in data
#length(red.wine[red.wine == 0])
#length(which(is.na(red.wine)))

#Check if log-transformation improves the data
old.par <- par(mar=c(1,1,1,1), mfrow=c(6,4),ps=6)
for ( col.n in colnames(red.wine)[1:6] ) {
  
  hist(red.wine[,col.n],col='brown1',xlab=col.n, main=col.n)
  qqnorm(red.wine[,col.n])
  qqline(red.wine[,col.n], col="red")
  hist(log(red.wine[,col.n]+1),col='lightblue',xlab=col.n, main=col.n)
  qqnorm(log(red.wine[,col.n]+1))
  qqline(log(red.wine[,col.n]+1), col="red")
}

par(old.par)
#Next 6 attributes
old.par <- par(mar=c(1,1,1,1), mfrow=c(6,4),ps=6)
for ( col.n in colnames(red.wine)[7:12] ) {
  
  hist(red.wine[,col.n],col='brown1',xlab=col.n, main=col.n)
  qqnorm(red.wine[,col.n])
  qqline(red.wine[,col.n], col="red")
  hist(log(red.wine[,col.n]+1),col='lightblue',xlab=col.n, main=col.n)
  qqnorm(log(red.wine[,col.n]+1))
  qqline(log(red.wine[,col.n]+1), col="red")
}

par(old.par)

Log-transformation improves all data attributes except “citric.acid” and “quality”. Log-transformation is justified for other attributes. For “density”“,”pH" attributes log-transformation has no significant effect.

#Log-transforming all except "citric.acid" and "quality"
red.wine.log = red.wine
for(cn in colnames(red.wine.log[,!names(red.wine.log) %in% c("citric.acid","quality")])){
  red.wine.log[,cn] = log(red.wine[,cn]+1)
  #colnames(red.wine.log)[names(red.wine.log) == cn] = paste0("L.",cn)
}
#Pair/Plot all 12 attributes of log-transformed data
#pairs(red.wine.log[,-grep("quality",colnames(red.wine.log))],cex=.2,pch=19,col=red.wine.log$quality*3)

#Correlation matrix
red.wine.cr=cor(red.wine,use="complete") 
signif(red.wine.cr,3)
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity               1.0000         -0.25600      0.6720
## volatile.acidity           -0.2560          1.00000     -0.5520
## citric.acid                 0.6720         -0.55200      1.0000
## residual.sugar              0.1150          0.00192      0.1440
## chlorides                   0.0937          0.06130      0.2040
## free.sulfur.dioxide        -0.1540         -0.01050     -0.0610
## total.sulfur.dioxide       -0.1130          0.07650      0.0355
## density                     0.6680          0.02200      0.3650
## pH                         -0.6830          0.23500     -0.5420
## sulphates                   0.1830         -0.26100      0.3130
## alcohol                    -0.0617         -0.20200      0.1100
## quality                     0.1240         -0.39100      0.2260
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity               0.11500   0.09370            -0.15400
## volatile.acidity            0.00192   0.06130            -0.01050
## citric.acid                 0.14400   0.20400            -0.06100
## residual.sugar              1.00000   0.05560             0.18700
## chlorides                   0.05560   1.00000             0.00556
## free.sulfur.dioxide         0.18700   0.00556             1.00000
## total.sulfur.dioxide        0.20300   0.04740             0.66800
## density                     0.35500   0.20100            -0.02190
## pH                         -0.08570  -0.26500             0.07040
## sulphates                   0.00553   0.37100             0.05170
## alcohol                     0.04210  -0.22100            -0.06940
## quality                     0.01370  -0.12900            -0.05070
##                      total.sulfur.dioxide density      pH sulphates
## fixed.acidity                     -0.1130  0.6680 -0.6830   0.18300
## volatile.acidity                   0.0765  0.0220  0.2350  -0.26100
## citric.acid                        0.0355  0.3650 -0.5420   0.31300
## residual.sugar                     0.2030  0.3550 -0.0857   0.00553
## chlorides                          0.0474  0.2010 -0.2650   0.37100
## free.sulfur.dioxide                0.6680 -0.0219  0.0704   0.05170
## total.sulfur.dioxide               1.0000  0.0713 -0.0665   0.04290
## density                            0.0713  1.0000 -0.3420   0.14900
## pH                                -0.0665 -0.3420  1.0000  -0.19700
## sulphates                          0.0429  0.1490 -0.1970   1.00000
## alcohol                           -0.2060 -0.4960  0.2060   0.09360
## quality                           -0.1850 -0.1750 -0.0577   0.25100
##                      alcohol quality
## fixed.acidity        -0.0617  0.1240
## volatile.acidity     -0.2020 -0.3910
## citric.acid           0.1100  0.2260
## residual.sugar        0.0421  0.0137
## chlorides            -0.2210 -0.1290
## free.sulfur.dioxide  -0.0694 -0.0507
## total.sulfur.dioxide -0.2060 -0.1850
## density              -0.4960 -0.1750
## pH                    0.2060 -0.0577
## sulphates             0.0936  0.2510
## alcohol               1.0000  0.4760
## quality               0.4760  1.0000
#corrPairs - is correlation visualization function(modified code from lectures)
corrPairs(red.wine.cr)

#Correlation matrix
red.wine.log.cr=cor(red.wine.log,use="complete")
signif(red.wine.log.cr,3)
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                1.000          -0.2610     0.66900
## volatile.acidity            -0.261           1.0000    -0.56400
## citric.acid                  0.669          -0.5640     1.00000
## residual.sugar               0.159           0.0242     0.16800
## chlorides                    0.120           0.0726     0.20200
## free.sulfur.dioxide         -0.178           0.0207    -0.08780
## total.sulfur.dioxide        -0.114           0.0841    -0.00255
## density                      0.674           0.0300     0.36500
## pH                          -0.704           0.2320    -0.54400
## sulphates                    0.191          -0.2830     0.32400
## alcohol                     -0.090          -0.2140     0.10900
## quality                      0.116          -0.3930     0.22600
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                0.1590   0.12000            -0.17800
## volatile.acidity             0.0242   0.07260             0.02070
## citric.acid                  0.1680   0.20200            -0.08780
## residual.sugar               1.0000   0.05590             0.10000
## chlorides                    0.0559   1.00000            -0.00557
## free.sulfur.dioxide          0.1000  -0.00557             1.00000
## total.sulfur.dioxide         0.1540   0.06220             0.78400
## density                      0.4060   0.21900            -0.03960
## pH                          -0.0896  -0.27300             0.09580
## sulphates                    0.0156   0.33800             0.05530
## alcohol                      0.0751  -0.23600            -0.08320
## quality                      0.0217  -0.13500            -0.05030
##                      total.sulfur.dioxide density      pH sulphates
## fixed.acidity                    -0.11400  0.6740 -0.7040    0.1910
## volatile.acidity                  0.08410  0.0300  0.2320   -0.2830
## citric.acid                      -0.00255  0.3650 -0.5440    0.3240
## residual.sugar                    0.15400  0.4060 -0.0896    0.0156
## chlorides                         0.06220  0.2190 -0.2730    0.3380
## free.sulfur.dioxide               0.78400 -0.0396  0.0958    0.0553
## total.sulfur.dioxide              1.00000  0.1040 -0.0171    0.0593
## density                           0.10400  1.0000 -0.3410    0.1570
## pH                               -0.01710 -0.3410  1.0000   -0.1840
## sulphates                         0.05930  0.1570 -0.1840    1.0000
## alcohol                          -0.23700 -0.4920  0.2030    0.1150
## quality                          -0.17100 -0.1750 -0.0576    0.2810
##                      alcohol quality
## fixed.acidity        -0.0900  0.1160
## volatile.acidity     -0.2140 -0.3930
## citric.acid           0.1090  0.2260
## residual.sugar        0.0751  0.0217
## chlorides            -0.2360 -0.1350
## free.sulfur.dioxide  -0.0832 -0.0503
## total.sulfur.dioxide -0.2370 -0.1710
## density              -0.4920 -0.1750
## pH                    0.2030 -0.0576
## sulphates             0.1150  0.2810
## alcohol               1.0000  0.4770
## quality               0.4770  1.0000
#Correlation visualization of log-transformed dara
corrPairs(red.wine.log.cr)

#Finding collinearity 
vif(lm(quality~.,data=red.wine.log))
##        fixed.acidity     volatile.acidity          citric.acid 
##             8.176442             1.832439             3.091536 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##             1.897230             1.447292             2.875551 
## total.sulfur.dioxide              density                   pH 
##             3.161360             7.110926             3.546283 
##            sulphates              alcohol 
##             1.456824             3.187275

There is weak negative correlation (-0.39) between “quality” attribute and “valotile.acidity”, and positive correlation (0.48) between “quality” attribute “alcohol”.

There is significant correlation between “fixed.acidicy” and “citric.acid” (0.67), “dencity” (0.67), “pH” (-0.7). Also, “total.surfur.dioxide” and “free.surfur.dioxide” are correlated (0.78).

Collinearity is acceptable (less than 10).

White Wine:

#Load white.wine data
white.wine = read.csv(file="winequality-white.csv",header=TRUE,sep=";")
#Dimentions of data
dim(white.wine)
## [1] 4898   12
#List first 6 rows with headers
head(white.wine)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.0             0.27        0.36           20.7     0.045
## 2           6.3             0.30        0.34            1.6     0.049
## 3           8.1             0.28        0.40            6.9     0.050
## 4           7.2             0.23        0.32            8.5     0.058
## 5           7.2             0.23        0.32            8.5     0.058
## 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6
#Summarizing data
summary(white.wine)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000
#Boxplodding each attribute
old.par <- par(mar=c(1,1,1,1),mfrow=c(1,12),ps=6)
for(i in colnames(white.wine)){
boxplot(white.wine[,i],main=i)
}

par(old.par)

#Pair/Plot all 12 attributes of untransformed data
pairs(white.wine,pch=19,cex=white.wine$quality^3/1000, col=hsv(50/360, 1, 1-white.wine$quality/10))

#pairs(white.wine,pch=19,cex=.3, col=white.wine$quality)
#Pair/Plot all 12 attributes of log-transformed data
#pairs(white.wine+1,pch=19,cex=.3,log="xy", col=white.wine$quality)
pairs(white.wine+1,pch=19,cex=white.wine$quality^3/1000,log="xy", col=hsv(50/360, 1, 1-white.wine$quality/10))

#Outlier detecting
#plot(white.wine[,c(5,1)])
#identify(white.wine[,c(5,1)])

#old.par<-par(mfrow=c(2,2))
#plot(lm(white.wine$quality~white.wine$density+white.wine$chlorides, white.wine))

#2782 is outliers, but does not have large effect on  model white.wine$quality~white.wine[,1]+white.wine[,5]

#Pair/Plot all 12 attributes of log-transformed data without 2 outliers
pairs(subset(white.wine, white.wine$density<1.0102)+1,pch=19,cex=white.wine$quality^3/1000,log="xy", col=hsv(50/360, 1, 1-white.wine$quality/10))

#Checking if there is zeros or NA in data
#length(white.wine[white.wine == 0])
#length(which(is.na(white.wine)))

#Check if log-transformation improves the data
old.par <- par(mar=c(1,1,1,1), mfrow=c(6,4),ps=6)
for ( col.n in colnames(white.wine)[1:6] ) {
  
  hist(white.wine[,col.n],col='darkgoldenrod1',xlab=col.n, main=col.n)
  qqnorm(white.wine[,col.n])
  qqline(white.wine[,col.n], col="red")
  hist(log(white.wine[,col.n]+1),col='lightblue',xlab=col.n, main=col.n)
  qqnorm(log(white.wine[,col.n]+1))
  qqline(log(white.wine[,col.n]+1), col="red")
}

par(old.par)
#Next 6 attributes
old.par <- par(mar=c(1,1,1,1), mfrow=c(6,4),ps=6)
for ( col.n in colnames(white.wine)[7:12] ) {
  
  hist(white.wine[,col.n],col='darkgoldenrod1',xlab=col.n, main=col.n)
  qqnorm(white.wine[,col.n])
  qqline(white.wine[,col.n], col="red")
  hist(log(white.wine[,col.n]+1),col='lightblue',xlab=col.n, main=col.n)
  qqnorm(log(white.wine[,col.n]+1))
  qqline(log(white.wine[,col.n]+1), col="red")
}

par(old.par)

Log-transformation improves all data attributes except “total.sulfur.dioxide” and “quality”. Log-transformation is justified for other attributes.

#Log-transforming all attributes except "total.sulfur.dioxide" and "quality"
white.wine.log = white.wine
for(cn in colnames(white.wine.log[,!names(white.wine.log) %in% c("total.sulfur.dioxide","quality")])){
  white.wine.log[,cn] = log(white.wine[,cn]+1)
  #colnames(white.wine.log)[names(white.wine.log) == cn] = paste0("L.",cn)
}
#Correlation matrix
white.wine.cr=cor(white.wine,use="complete") 
signif(white.wine.cr,3)
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity               1.0000          -0.0227     0.28900
## volatile.acidity           -0.0227           1.0000    -0.14900
## citric.acid                 0.2890          -0.1490     1.00000
## residual.sugar              0.0890           0.0643     0.09420
## chlorides                   0.0231           0.0705     0.11400
## free.sulfur.dioxide        -0.0494          -0.0970     0.09410
## total.sulfur.dioxide        0.0911           0.0893     0.12100
## density                     0.2650           0.0271     0.15000
## pH                         -0.4260          -0.0319    -0.16400
## sulphates                  -0.0171          -0.0357     0.06230
## alcohol                    -0.1210           0.0677    -0.07570
## quality                    -0.1140          -0.1950    -0.00921
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                0.0890    0.0231           -0.049400
## volatile.acidity             0.0643    0.0705           -0.097000
## citric.acid                  0.0942    0.1140            0.094100
## residual.sugar               1.0000    0.0887            0.299000
## chlorides                    0.0887    1.0000            0.101000
## free.sulfur.dioxide          0.2990    0.1010            1.000000
## total.sulfur.dioxide         0.4010    0.1990            0.616000
## density                      0.8390    0.2570            0.294000
## pH                          -0.1940   -0.0904           -0.000618
## sulphates                   -0.0267    0.0168            0.059200
## alcohol                     -0.4510   -0.3600           -0.250000
## quality                     -0.0976   -0.2100            0.008160
##                      total.sulfur.dioxide density        pH sulphates
## fixed.acidity                     0.09110  0.2650 -0.426000   -0.0171
## volatile.acidity                  0.08930  0.0271 -0.031900   -0.0357
## citric.acid                       0.12100  0.1500 -0.164000    0.0623
## residual.sugar                    0.40100  0.8390 -0.194000   -0.0267
## chlorides                         0.19900  0.2570 -0.090400    0.0168
## free.sulfur.dioxide               0.61600  0.2940 -0.000618    0.0592
## total.sulfur.dioxide              1.00000  0.5300  0.002320    0.1350
## density                           0.53000  1.0000 -0.093600    0.0745
## pH                                0.00232 -0.0936  1.000000    0.1560
## sulphates                         0.13500  0.0745  0.156000    1.0000
## alcohol                          -0.44900 -0.7800  0.121000   -0.0174
## quality                          -0.17500 -0.3070  0.099400    0.0537
##                      alcohol  quality
## fixed.acidity        -0.1210 -0.11400
## volatile.acidity      0.0677 -0.19500
## citric.acid          -0.0757 -0.00921
## residual.sugar       -0.4510 -0.09760
## chlorides            -0.3600 -0.21000
## free.sulfur.dioxide  -0.2500  0.00816
## total.sulfur.dioxide -0.4490 -0.17500
## density              -0.7800 -0.30700
## pH                    0.1210  0.09940
## sulphates            -0.0174  0.05370
## alcohol               1.0000  0.43600
## quality               0.4360  1.00000
#Correlation matrix visualized
corrPairs(white.wine.cr)

#Correlation matrix of log-transformed data
white.wine.log.cr=cor(white.wine.log,use="complete")
signif(white.wine.log.cr,3)
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity               1.0000          -0.0306     0.30400
## volatile.acidity           -0.0306           1.0000    -0.17100
## citric.acid                 0.3040          -0.1710     1.00000
## residual.sugar              0.0874           0.09    0.07100
## chlorides                   0.0341           0.0682     0.10700
## free.sulfur.dioxide        -0.0465          -0.1130     0.08690
## total.sulfur.dioxide        0.1010           0.0967     0.12000
## density                     0.2760           0.0253     0.14600
## pH                         -0.4350          -0.0346    -0.16600
## sulphates                  -0.0153          -0.0373     0.06720
## alcohol                    -0.1250           0.0577    -0.06890
## quality                    -0.1100          -0.1960     0.00599
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                0.0874    0.0341             -0.0465
## volatile.acidity             0.09   0.0682             -0.1130
## citric.acid                  0.0710    0.1070              0.0869
## residual.sugar               1.0000    0.0836              0.3260
## chlorides                    0.0836    1.0000              0.0957
## free.sulfur.dioxide          0.3260    0.0957              1.0000
## total.sulfur.dioxide         0.4210    0.2070              0.6000
## density                      0.7780    0.2690              0.2850
## pH                          -0.1840   -0.0906              0.0217
## sulphates                   -0.0324    0.0237              0.0631
## alcohol                     -0.4250   -0.3750             -0.2310
## quality                     -0.0753   -0.2160              0.0949
##                      total.sulfur.dioxide density       pH sulphates
## fixed.acidity                     0.10100  0.2760 -0.43500   -0.0153
## volatile.acidity                  0.09670  0.0253 -0.03460   -0.0373
## citric.acid                       0.12000  0.1460 -0.16600    0.0672
## residual.sugar                    0.42100  0.7780 -0.18400   -0.0324
## chlorides                         0.20700  0.2690 -0.09060    0.0237
## free.sulfur.dioxide               0.60000  0.2850  0.02170    0.0631
## total.sulfur.dioxide              1.00000  0.5300  0.00347    0.1430
## density                           0.53000  1.0000 -0.09480    0.0823
## pH                                0.00347 -0.0948  1.00000    0.1580
## sulphates                         0.14300  0.0823  0.15800    1.0000
## alcohol                          -0.45400 -0.7860  0.12900   -0.0252
## quality                          -0.17500 -0.3070  0.09940    0.0484
##                      alcohol  quality
## fixed.acidity        -0.1250 -0.11000
## volatile.acidity      0.0577 -0.19600
## citric.acid          -0.0689  0.00599
## residual.sugar       -0.4250 -0.07530
## chlorides            -0.3750 -0.21600
## free.sulfur.dioxide  -0.2310  0.09490
## total.sulfur.dioxide -0.4540 -0.17500
## density              -0.7860 -0.30700
## pH                    0.1290  0.09940
## sulphates            -0.0252  0.04840
## alcohol               1.0000  0.43200
## quality               0.4320  1.00000
corrPairs(white.wine.log.cr)

There is weak negative correlation (-0.31) between “quality” attribute “density” and positive correlation (0.43) between “quality” attribute “alcohol”.

There is negative correlation between “alcohol” and “density” (-0.79), week negative correlation between “alcohol” and “residual.sugar”,“free.sulfur.dioxide” (-0.43,-0.45 respectively). There is strong positive correlation between “residual.sugar” and “density” (0.78). There is negative correlation between “total.sulfur.dioxide” and “free.sulfur.dioxide” (0.6).

Because of the strong correlation between “resudial.sugar” and “density” (0.78), and “alcohol” and “density” (-0.79), if we skip “density” attribute we can avoid collinearity problems.

#Findung collinearity
vif(lm(quality~.,data=white.wine.log))
##        fixed.acidity     volatile.acidity          citric.acid 
##             1.965851             1.159768             1.183259 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##             5.246982             1.222915             1.751950 
## total.sulfur.dioxide              density                   pH 
##             2.176463            12.033690             1.690311 
##            sulphates              alcohol 
##             1.103493             4.883773

VIF value for “density” is greater than 10. The attribute “density” will be excluded.

corrPairs(cor(white.wine.log,use="complete"))

Now all VIF values are close to 1.

#Excluding "density" attribute
white.wine.log = subset(white.wine.log, select=-c(density))
#Checking collinearity
vif(lm(quality~.,data=white.wine.log))
##        fixed.acidity     volatile.acidity          citric.acid 
##             1.375078             1.158206             1.175535 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##             1.440990             1.215371             1.727452 
## total.sulfur.dioxide                   pH            sulphates 
##             2.152719             1.340710             1.062249 
##              alcohol 
##             1.638456

Sub-problem 2: choose optimal models by exhaustive, forward and backward selection

Use regsubsets from library leaps to choose optimal set of variables for modeling wine quality for red and white wine (separately), describe differences and similarities between attributes deemed important in each case.

Red Wine:

#Finding values for metrics("rsq","rss","adjr2","cp","bic") with three methods ("exhaustive", "backward", "forward") for different number of attributes in model
summaryMetrics <- NULL
whichAll <- list()
regsubsetsAll <- list()
for ( myMthd in c("exhaustive", "backward", "forward") ) {
  rsRes <- regsubsets(red.wine.log$quality~.,red.wine.log,method=myMthd,nvmax=11,really.big = T)
  regsubsetsAll[[myMthd]] <- rsRes
  summRes <- summary(rsRes)
  whichAll[[myMthd]] <- summRes$which
  for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
    summaryMetrics <- rbind(summaryMetrics,
      data.frame(method=myMthd,metric=metricName,
                nvars=1:length(summRes[[metricName]]),
                value=summRes[[metricName]]))
  }
}
#Plotting results
ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") +   theme(legend.position="top")

All three variable selection methods when applied to the entire dataset yield models with very similar fit metrics. For all of them, except for BIC and CP, increase in variable number appears to result in progressive improvement of the fit. BIC reaches minimum when 7 out of 11 variables are in the model. CP at 9. After adding 2nd and 3rd variable to the model shows signifficant and until 7th variable slight improvement.

Using 6 variables;

#Exhaustive method
red.wine.log.set6 = names(whichAll$exhaustive[6,whichAll$exhaustive[6,]==TRUE])[-1]
red.wine.log.set6
## [1] "volatile.acidity"     "chlorides"            "total.sulfur.dioxide"
## [4] "pH"                   "sulphates"            "alcohol"
#Backward method
names(whichAll$backward[6,whichAll$backward[6,]==TRUE])[-1]
## [1] "volatile.acidity"     "chlorides"            "total.sulfur.dioxide"
## [4] "pH"                   "sulphates"            "alcohol"
#Forward method
names(whichAll$forward[6,whichAll$forward[6,]==TRUE])[-1]
## [1] "volatile.acidity"     "chlorides"            "total.sulfur.dioxide"
## [4] "pH"                   "sulphates"            "alcohol"

All methods found same 6 variables for the model.

#Visualizing regsubset results
old.par <- par(mfrow=c(1,3),ps=10)
for ( myMthd in names(regsubsetsAll) ) {
  plot(regsubsetsAll[[myMthd]],main=myMthd)
}

par(old.par)

It looks like in this case all three methods choose the same variables when applied to the red wine dataset for a given variable number.

Same conclusion can be obtained when just visualizing variable membership in the models in the order of their size:

old.par <- par(mfrow=c(1,3),ps=10)
for ( myMthd in names(whichAll) ) {
  image(1:nrow(whichAll[[myMthd]]),
        1:ncol(whichAll[[myMthd]]),
        whichAll[[myMthd]],xlab="N(vars)",ylab="",
        xaxt="n",yaxt="n",breaks=c(-0.5,0.5,1.5),
        col=c("white","gray"),main=myMthd)
  axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
  axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}

par(old.par)

predict.regsubsets <- function (object, newdata, id, ...){
  form=as.formula(object$call [[2]])
  mat=model.matrix(form,newdata)
  coefi=coef(object,id=id)
  xvars=names (coefi)
  mat[,xvars] %*% coefi
}

dfTmp <- NULL
whichSum <- array(0,dim=c(11,12,3),
  dimnames=list(NULL,colnames(model.matrix(quality~.,red.wine.log)),
      c("exhaustive", "backward", "forward")))
# Split data into training and test 30 times:
nTries <- 30
for ( iTry in 1:nTries ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(red.wine.log)))
  # Try each method available in regsubsets
  # to select the best model of each size:
  for ( jSelect in c("exhaustive", "backward", "forward") ) {
    rsTrain <- regsubsets(quality~.,red.wine.log[bTrain,],nvmax=11,method=jSelect)
    # Add up variable selections:
    whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
    # Calculate test error for each set of variables
    # using predict.regsubsets implemented above:
    for ( kVarSet in 1:11 ) {
      # make predictions:
      testPred <- predict(rsTrain,red.wine.log[!bTrain,],id=kVarSet)
      # calculate MSE:
      mseTest <- mean((testPred-red.wine.log[!bTrain,"quality"])^2)
      # add to data.frame for future plotting:
      dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
      mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
    }
  }
}

#Mean test MSE  
mean(dfTmp[dfTmp$vars == 6 & dfTmp$sel == "exhaustive" & dfTmp$trainTest == "test", "mse"])
## [1] 0.4284542
# plot MSEs by training/test, number of 
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)

White Wine:

#Finding values for metrics("rsq","rss","adjr2","cp","bic") with three methods ("exhaustive", "backward", "forward") for different number of attributes in model
summaryMetrics <- NULL
whichAll <- list()
regsubsetsAll <- list()
for ( myMthd in c("exhaustive", "backward", "forward") ) {
  rsRes <- regsubsets(white.wine.log$quality~.,white.wine.log,method=myMthd,nvmax=10,really.big = T)
  regsubsetsAll[[myMthd]] <- rsRes
  summRes <- summary(rsRes)
  whichAll[[myMthd]] <- summRes$which
  for ( metricName in c("rsq","rss","adjr2","cp","bic") ) {
    summaryMetrics <- rbind(summaryMetrics,
      data.frame(method=myMthd,metric=metricName,
                nvars=1:length(summRes[[metricName]]),
                value=summRes[[metricName]]))
  }
}

ggplot(summaryMetrics,aes(x=nvars,y=value,shape=method,colour=method)) + geom_path() + geom_point() + facet_wrap(~metric,scales="free") +   theme(legend.position="top")

All three variable selection methods when applied to the entire dataset yield models with very similar fit metrics. For RSQ and RSS increase in variable number appears to result in progressive improvement of the fit. AdjR2, BIC and CP reaches best fit when 6 out of 10 variables are in the model. After adding 2nd, 3rd and 4th variable the model shows signifficant and until 6th variable slight improvement.

Using 5 variables;

#Exhaustive method
white.wine.log.set5 = names(whichAll$exhaustive[5,whichAll$exhaustive[5,]==TRUE])[-1]
white.wine.log.set5
## [1] "volatile.acidity"     "residual.sugar"       "free.sulfur.dioxide" 
## [4] "total.sulfur.dioxide" "alcohol"
#Backward method
names(whichAll$backward[5,whichAll$backward[5,]==TRUE])[-1]
## [1] "volatile.acidity"     "residual.sugar"       "free.sulfur.dioxide" 
## [4] "total.sulfur.dioxide" "alcohol"
#Forward method
names(whichAll$forward[5,whichAll$forward[5,]==TRUE])[-1]
## [1] "volatile.acidity"     "residual.sugar"       "free.sulfur.dioxide" 
## [4] "total.sulfur.dioxide" "alcohol"
#Visualizing regsubset results
old.par <- par(mfrow=c(1,3),ps=10)
for ( myMthd in names(regsubsetsAll) ) {
  plot(regsubsetsAll[[myMthd]],main=myMthd)
}

par(old.par)

All three variable selection methods choose the same variables when applied to the white wine dataset for a given variable number.

Same conclusion can be obtained when just visualizing variable membership in the models in the order of their size:

old.par <- par(mfrow=c(1,3),ps=10)
for ( myMthd in names(whichAll) ) {
  image(1:nrow(whichAll[[myMthd]]),
        1:ncol(whichAll[[myMthd]]),
        whichAll[[myMthd]],xlab="N(vars)",ylab="",
        xaxt="n",yaxt="n",breaks=c(-0.5,0.5,1.5),
        col=c("white","gray"),main=myMthd)
  axis(1,1:nrow(whichAll[[myMthd]]),rownames(whichAll[[myMthd]]))
  axis(2,1:ncol(whichAll[[myMthd]]),colnames(whichAll[[myMthd]]),las=2)
}

par(old.par)

dfTmp <- NULL
whichSum <- array(0,dim=c(10,11,3),
  dimnames=list(NULL,colnames(model.matrix(quality~.,white.wine.log)),
      c("exhaustive", "backward", "forward")))
# Split data into training and test 30 times:
nTries <- 30
for ( iTry in 1:nTries ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(white.wine.log)))
  # Try each method available in regsubsets
  # to select the best model of each size:
  for ( jSelect in c("exhaustive", "backward", "forward") ) {
    rsTrain <- regsubsets(quality~.,white.wine.log[bTrain,],nvmax=10,method=jSelect)
    # Add up variable selections:
    whichSum[,,jSelect] <- whichSum[,,jSelect] + summary(rsTrain)$which
    # Calculate test error for each set of variables
    # using predict.regsubsets implemented above:
    for ( kVarSet in 1:10 ) {
      # make predictions:
      testPred <- predict(rsTrain,white.wine.log[!bTrain,],id=kVarSet)
      # calculate MSE:
      mseTest <- mean((testPred-white.wine.log[!bTrain,"quality"])^2)
      # add to data.frame for future plotting:
      dfTmp <- rbind(dfTmp,data.frame(sim=iTry,sel=jSelect,vars=kVarSet,
      mse=c(mseTest,summary(rsTrain)$rss[kVarSet]/sum(bTrain)),trainTest=c("test","train")))
    }
  }
}

#Mean test MSE  
mean(dfTmp[dfTmp$vars == 5 & dfTmp$sel == "exhaustive" & dfTmp$trainTest == "test", "mse"])
## [1] 0.5662291
# plot MSEs by training/test, number of 
# variables and selection method:
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest)

#Comparing attributes for red and whine types
wine.model.table = table(c(red.wine.log.set6,white.wine.log.set5))
wine.model.table
## 
##              alcohol            chlorides  free.sulfur.dioxide 
##                    2                    1                    1 
##                   pH       residual.sugar            sulphates 
##                    1                    1                    1 
## total.sulfur.dioxide     volatile.acidity 
##                    2                    2

These attributes are used for both wine types:

#Both models for wine types use these 3 attributes
names(wine.model.table[wine.model.table==2])
## [1] "alcohol"              "total.sulfur.dioxide" "volatile.acidity"

Top 3 attributes for both wine type for all methods in models are “alcohol”, “volatile.acidity” and “sulphates” (ordered by from top 1st to 3rd).

Sub-problem 3: optimal model by cross-validation ()

Use cross-validation (or any other resampling strategy of your choice) to estimate test error for models with different numbers of variables. Compare and comment on the number of variables deemed optimal by resampling versus those selected by regsubsets in the previous task. Compare resulting models built separately for red and white wine data.

Red Wine:

#Cross-validation function
xvalMSEregsubsetsWine <- function(wine,nTries=30,kXval=5) {
  retRes <- NULL
  for ( iTry in 1:nTries ) {
    xvalFolds <- sample(rep(1:kXval,length.out=nrow(wine)))
    # Try each method available in regsubsets
    # to select the best model of each size:
    for ( jSelect in c("exhaustive", "backward", "forward") ) {
      mthdTestErr2 <- NULL
      mthdTrainErr2 <- NULL
      mthdTestFoldErr2 <- NULL
      for ( kFold in 1:kXval ) {
        rsTrain <- regsubsets(quality~.,wine[xvalFolds!=kFold,],nvmax=(length(wine)-1),method=jSelect)
        # Calculate test error for each set of variables
        # using predict.regsubsets implemented above:
        nVarTestErr2 <- NULL
        nVarTrainErr2 <- NULL
        for ( kVarSet in 1:(length(wine)-1) ) {
          # make predictions for given number of variables and cross-validation fold:
          kCoef <- coef(rsTrain,id=kVarSet)
          testPred <- model.matrix(quality~.,wine[xvalFolds==kFold,])[,names(kCoef)] %*% kCoef
          nVarTestErr2 <- cbind(nVarTestErr2,(testPred-wine[xvalFolds==kFold,"quality"])^2)
          trainPred <- model.matrix(quality~.,wine[xvalFolds!=kFold,])[,names(kCoef)] %*% kCoef
          nVarTrainErr2 <- cbind(nVarTrainErr2,(trainPred-wine[xvalFolds!=kFold,"quality"])^2)
        }
        # accumulate training and test errors over all cross-validation folds:
        mthdTestErr2 <- rbind(mthdTestErr2,nVarTestErr2)
        mthdTestFoldErr2 <- rbind(mthdTestFoldErr2,colMeans(nVarTestErr2))
        mthdTrainErr2 <- rbind(mthdTrainErr2,nVarTrainErr2)
      }
      #cat(kXval,iTry,jSelect,dim(mthdTrainErr2),dim(mthdTestErr2),fill=TRUE)
      # add to data.frame for future plotting:
      retRes <- rbind(retRes,
                    data.frame(sim=iTry,sel=jSelect,vars=1:ncol(nVarTrainErr2),mse=colMeans(mthdTrainErr2),trainTest="train"),
                    data.frame(sim=iTry,sel=jSelect,vars=1:ncol(nVarTrainErr2),mse=colMeans(mthdTestErr2),trainTest="test"))
      #for ( iRow in 1:nrow(mthdTestFoldErr2) ) {
      #  retRes <- rbind(retRes,data.frame(sim=iTry,sel=jSelect,vars=1:ncol(mthdTestFoldErr2),mse=mthdTestFoldErr2[iRow,],trainTest="testFold"))
      #}
    }
  }
  retRes
}

#Plotting MSEs by training/test, number of variables:
dfTmp <- rbind(data.frame(xvalMSEregsubsetsWine(red.wine.log,30,kXval=2),xval="2-fold"),
               data.frame(xvalMSEregsubsetsWine(red.wine.log,30,kXval=5),xval="5-fold"),
               data.frame(xvalMSEregsubsetsWine(red.wine.log,30,kXval=10),xval="10-fold"))
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest+xval)

From the ggplot we can see that adding 2nd and 3rd attribute decreases MSE significantly. The three methods yield models of very comparable performance. Variability of test error is higher that training error. Model with 3 or 4 variables could be justified.

Detecting best 3 attributes based on CV results:

#Listing attributes of the optimal model
rs.red.wine.log=regsubsets(quality~.,red.wine.log, nvmax=11)
rs.red.wine.summary <- as.data.frame(summary(rs.red.wine.log)$outmat)
rs.red.tmp=rs.red.wine.summary[3,]
red.wine.log.rs3 = names(rs.red.tmp[,rs.red.tmp[nrow(rs.red.tmp),]=="*",drop=FALSE])
red.wine.log.rs3
## [1] "volatile.acidity" "sulphates"        "alcohol"
#Both redsubset and CV use these 3 attributes in the model
red.wine.set.rs.table = table(c(red.wine.log.set6,red.wine.log.rs3))

names(red.wine.set.rs.table[red.wine.set.rs.table==2])
## [1] "alcohol"          "sulphates"        "volatile.acidity"
red.wine.lm.rs3 = lm(quality~.,red.wine.log[,c(red.wine.log.rs3, "quality")])
red.wine.lm.set6 = lm(quality~.,red.wine.log[,c(red.wine.log.set6, "quality")])

#red.wine.lm.rs3.mse <- function(red.wine.lm.rs3) mean(red.wine.lm.rs3$residuals^2)
#red.wine.lm.rs3.mse

#red.wine.lm.set6.mse <- function(red.wine.lm.set6) mean(red.wine.lm.set6$residuals^2)
#red.wine.lm.set6.mse

#MSE of the optimal model with CV
mean((red.wine.log[,"quality"] - predict(red.wine.lm.rs3))^2)
## [1] 0.4300604
#MSE of the optimal model with regsubset
mean((red.wine.log[,"quality"] - predict(red.wine.lm.set6))^2)
## [1] 0.4190814
#Summarizing CV model
summary(red.wine.lm.rs3)
## 
## Call:
## lm(formula = quality ~ ., data = red.wine.log[, c(red.wine.log.rs3, 
##     "quality")])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.74451 -0.37492 -0.05681  0.47051  2.14453 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -3.0291     0.4763  -6.359 2.65e-10 ***
## volatile.acidity  -1.8406     0.1518 -12.127  < 2e-16 ***
## sulphates          1.3830     0.1830   7.557 6.93e-14 ***
## alcohol            3.5945     0.1862  19.300  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6566 on 1595 degrees of freedom
## Multiple R-squared:  0.3402, Adjusted R-squared:  0.3389 
## F-statistic: 274.1 on 3 and 1595 DF,  p-value: < 2.2e-16
#Summarizing regsubset model
summary(red.wine.lm.set6)
## 
## Call:
## lm(formula = quality ~ ., data = red.wine.log[, c(red.wine.log.set6, 
##     "quality")])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6083 -0.3535 -0.0511  0.4602  1.9028 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.26760    0.80716   0.332 0.740283    
## volatile.acidity     -1.56577    0.15782  -9.921  < 2e-16 ***
## chlorides            -2.38223    0.47878  -4.976 7.21e-07 ***
## total.sulfur.dioxide -0.08178    0.02465  -3.317 0.000930 ***
## pH                   -1.76186    0.50136  -3.514 0.000453 ***
## sulphates             1.74939    0.19874   8.802  < 2e-16 ***
## alcohol               3.37798    0.20138  16.774  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6488 on 1592 degrees of freedom
## Multiple R-squared:  0.357,  Adjusted R-squared:  0.3546 
## F-statistic: 147.3 on 6 and 1592 DF,  p-value: < 2.2e-16

Both models return comparable results with RSE about 0.65 on about 1592 degrees of freedom and AdjR2 is about 0.35 for both models.

Values of coefficents are different, it can be due hidden collinearity between variables.

White Wine:

#Plotting MSEs by training/test, number of variables:
dfTmp <- rbind(data.frame(xvalMSEregsubsetsWine(white.wine.log,30,kXval=2),xval="2-fold"),
               data.frame(xvalMSEregsubsetsWine(white.wine.log,30,kXval=5),xval="5-fold"),
               data.frame(xvalMSEregsubsetsWine(white.wine.log,30,kXval=10),xval="10-fold"))
ggplot(dfTmp,aes(x=factor(vars),y=mse,colour=sel)) + geom_boxplot()+facet_wrap(~trainTest+xval)

From the ggplot we can see that adding 2nd and 3rd attribute decreases MSE significantly. The three methods yield models of very comparable performance. Variability of test error is higher that training error. Model with 3 or 4 variables could be justified.

Detecting best 3 attributes based on CV results:

#Listing attributes of the optimal model
rs.white.wine.log=regsubsets(quality~.,white.wine.log, nvmax=11)
rs.white.wine.summary <- as.data.frame(summary(rs.white.wine.log)$outmat)
rs.white.tmp=rs.white.wine.summary[3,]
white.wine.log.rs3 = names(rs.white.tmp[,rs.white.tmp[nrow(rs.white.tmp),]=="*",drop=FALSE])
white.wine.log.rs3
## [1] "volatile.acidity"    "free.sulfur.dioxide" "alcohol"
white.wine.set.rs.table = table(c(white.wine.log.set5,white.wine.log.rs3))

names(white.wine.set.rs.table[white.wine.set.rs.table==2])
## [1] "alcohol"             "free.sulfur.dioxide" "volatile.acidity"
white.wine.lm.rs3 = lm(quality~.,white.wine.log[,c(white.wine.log.rs3, "quality")])
white.wine.lm.set5 = lm(quality~.,white.wine.log[,c(white.wine.log.set5, "quality")])

#white.wine.lm.rs3.mse <- function(white.wine.lm.rs3) mean(white.wine.lm.rs3$residuals^2)
#white.wine.lm.rs3.mse

#white.wine.lm.set5.mse <- function(white.wine.lm.set5) mean(white.wine.lm.set5$residuals^2)
#white.wine.lm.set5.mse

#MSE of the optimal model with CV
mean((white.wine.log[,"quality"] - predict(white.wine.lm.rs3))^2)
## [1] 0.5746447
#MSE of the optimal model with regsubset
mean((white.wine.log[,"quality"] - predict(white.wine.lm.set5))^2)
## [1] 0.561256
#Summarizing CV model
summary(white.wine.lm.rs3)
## 
## Call:
## lm(formula = quality ~ ., data = white.wine.log[, c(white.wine.log.rs3, 
##     "quality")])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6117 -0.4986 -0.0470  0.4744  3.1645 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.61910    0.28708  -16.09   <2e-16 ***
## volatile.acidity    -2.38976    0.14510  -16.47   <2e-16 ***
## free.sulfur.dioxide  0.31334    0.02147   14.60   <2e-16 ***
## alcohol              4.09708    0.10603   38.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7584 on 4894 degrees of freedom
## Multiple R-squared:  0.2672, Adjusted R-squared:  0.2668 
## F-statistic: 594.9 on 3 and 4894 DF,  p-value: < 2.2e-16
#Summarizing regsubset model
summary(white.wine.lm.set5)
## 
## Call:
## lm(formula = quality ~ ., data = white.wine.log[, c(white.wine.log.set5, 
##     "quality")])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5452 -0.5005 -0.0362  0.4571  3.0220 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -4.9905752  0.3177966 -15.704  < 2e-16 ***
## volatile.acidity     -2.3813164  0.1489658 -15.986  < 2e-16 ***
## residual.sugar        0.1634996  0.0173281   9.435  < 2e-16 ***
## free.sulfur.dioxide   0.3580011  0.0266641  13.426  < 2e-16 ***
## total.sulfur.dioxide -0.0023278  0.0003581  -6.500 8.83e-11 ***
## alcohol               4.1995906  0.1215869  34.540  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7496 on 4892 degrees of freedom
## Multiple R-squared:  0.2843, Adjusted R-squared:  0.2836 
## F-statistic: 388.6 on 5 and 4892 DF,  p-value: < 2.2e-16

Both models return comparable results with RSE about 0.75 on about 4891 degrees of freedom and AdjR2 is about 0.287 for both models.

Coefficents for two models are comparable.

Sub-problem 4: lasso/ridge ()

Use regularized approaches (i.e. lasso and ridge) to model quality of red and white wine (separately). Compare resulting models (in terms of number of variables and their effects) to those selected in the previous two tasks (by regsubsets and resampling), comment on differences and similarities among them.

Red Wine:

Lasso

x <- model.matrix(quality~.,red.wine.log)[,-1]
y <- red.wine.log[,"quality"]
lassoRes <- glmnet(x,y,alpha=1)
#plot(lassoRes)

#head(x)
#head(y)

#With default λ’s eleventh variable doesn’t enter the model
cvLassoRes <- cv.glmnet(x,y,alpha=1)
plot(cvLassoRes)

#cvLassoRes <- cv.glmnet(x,y,alpha=1,lambda=10^((-200:20)/80))
#plot(cvLassoRes)

predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           38.23272932
## fixed.acidity          0.45110221
## volatile.acidity      -1.68217813
## citric.acid           -0.26814100
## residual.sugar         0.09214019
## chlorides             -2.04553674
## free.sulfur.dioxide    0.10282436
## total.sulfur.dioxide  -0.14706971
## density              -56.59623090
## pH                    -1.23353127
## sulphates              1.81496699
## alcohol                3.09630318
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                               1
## (Intercept)          -1.5380136
## fixed.acidity         .        
## volatile.acidity     -1.4827296
## citric.acid           .        
## residual.sugar        .        
## chlorides             .        
## free.sulfur.dioxide   .        
## total.sulfur.dioxide  .        
## density               .        
## pH                    .        
## sulphates             0.8686966
## alcohol               3.0258839
#Similarly to what was seen above, optimal (in min-1SE sense) model by lasso includes five variables except for MYCT

lassoResScaled <- glmnet(scale(x),y,alpha=1)
cvLassoResScaled <- cv.glmnet(scale(x),y,alpha=1)
predict(lassoResScaled,type="coefficients",s=cvLassoResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           5.636022514
## fixed.acidity         .          
## volatile.acidity     -0.173858630
## citric.acid           .          
## residual.sugar        .          
## chlorides             .          
## free.sulfur.dioxide   .          
## total.sulfur.dioxide -0.003127056
## density               .          
## pH                    .          
## sulphates             0.086027559
## alcohol               0.277475428

Lasso on red wine returned comparable results with 6 variables as regsubsets, with difference that instead of “pH” varibale “fixed.acidity”. Test MSE is slightly higher 0.45 (vs 0.43). Resampling results with threee variables (“alcohol”, “sulphates”,“volatile.acidity”) are in top 3 coefficent in lasso.

lassoCoefCnt <- 0
lassoMSE <- NULL
for ( iTry in 1:30 ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
  cvLassoTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=1,lambda=10^((-120:0)/20))
  lassoTrain <- glmnet(x[bTrain,],y[bTrain],alpha=1,lambda=10^((-120:0)/20))
  lassoTrainCoef <- predict(lassoTrain,type="coefficients",s=cvLassoTrain$lambda.1se)
  lassoCoefCnt <- lassoCoefCnt + (lassoTrainCoef[-1,1]!=0)
  lassoTestPred <- predict(lassoTrain,newx=x[!bTrain,],s=cvLassoTrain$lambda.1se)
  lassoMSE <- c(lassoMSE,mean((lassoTestPred-y[!bTrain])^2))
}
mean(lassoMSE)
## [1] 0.453249
quantile(lassoMSE)
##        0%       25%       50%       75%      100% 
## 0.4196183 0.4377595 0.4544767 0.4642479 0.4999931
lassoCoefCnt
##        fixed.acidity     volatile.acidity          citric.acid 
##                    6                   30                    1 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                    0                    4                    0 
## total.sulfur.dioxide              density                   pH 
##                    8                    0                    4 
##            sulphates              alcohol 
##                   30                   30

Ridge

ridgeRes <- glmnet(x,y,alpha=0)
plot(ridgeRes)

cvRidgeRes <- cv.glmnet(x,y,alpha=0)
plot(cvRidgeRes)

cvRidgeRes$lambda.min
## [1] 0.0422638
cvRidgeRes$lambda.1se
## [1] 0.4325831
#With default λ’s the lowest MSE is attained for the least regularized model (for the lowest λ)

#cvRidgeRes <- cv.glmnet(x,y,alpha=0,lambda=10^((-50:60)/20))
#plot(cvRidgeRes)

#cvRidgeRes$lambda.min

#cvRidgeRes$lambda.1se

predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.min)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           47.06938544
## fixed.acidity          0.43923453
## volatile.acidity      -1.57485179
## citric.acid           -0.16220682
## residual.sugar         0.10269990
## chlorides             -2.04327675
## free.sulfur.dioxide    0.08804357
## total.sulfur.dioxide  -0.13695454
## density              -69.16741864
## pH                    -1.00235782
## sulphates              1.76033582
## alcohol                2.88585157
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)           46.30667019
## fixed.acidity          0.23475300
## volatile.acidity      -1.14508315
## citric.acid            0.15464683
## residual.sugar         0.06651849
## chlorides             -1.56800056
## free.sulfur.dioxide    0.02536607
## total.sulfur.dioxide  -0.08589000
## density              -65.73562003
## pH                    -0.46594150
## sulphates              1.25020851
## alcohol                2.07731031
#As expected, for more regularized model (using 1SE rule) coefficients are smaller by absolute value than those at the minimum of MSE

ridgeResScaled <- glmnet(scale(x),y,alpha=0)
cvRidgeResScaled <- cv.glmnet(scale(x),y,alpha=0,lambda=10^((-50:60)/20))
predict(ridgeResScaled,type="coefficients",s=cvRidgeResScaled$lambda.1se)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                1
## (Intercept)           5.63602251
## fixed.acidity         0.03725670
## volatile.acidity     -0.12270267
## citric.acid           0.03421604
## residual.sugar        0.01587369
## chlorides            -0.05614895
## free.sulfur.dioxide   0.01134691
## total.sulfur.dioxide -0.05360908
## density              -0.05890823
## pH                   -0.01458087
## sulphates             0.10722510
## alcohol               0.17333858

Top 4 variables is also selected by regsubsets. Resampling results with threee varables (“alcohol”, “sulphates”,“volatile.acidity”) also have top 3 coefficent in ridge. Intercept for lasso and ridge methods are comparable. MSE is comparable to lasso.

ridgeCoefCnt <- 0
ridgeCoefAve <- 0
ridgeMSE <- NULL
for ( iTry in 1:30 ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
  cvridgeTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=0,lambda=10^((-50:50)/20))
  ridgeTrain <- glmnet(x[bTrain,],y[bTrain],alpha=0,lambda=10^((-50:50)/20))
  ridgeTrainCoef <- predict(ridgeTrain,type="coefficients",s=cvridgeTrain$lambda.1se)
  ridgeCoefCnt <- ridgeCoefCnt + (ridgeTrainCoef[-1,1]!=0)
  ridgeCoefAve <- ridgeCoefAve + ridgeTrainCoef[-1,1]
  ridgeTestPred <- predict(ridgeTrain,newx=x[!bTrain,],s=cvridgeTrain$lambda.1se)
  ridgeMSE <- c(ridgeMSE,mean((ridgeTestPred-y[!bTrain])^2))
}
ridgeCoefAve <- ridgeCoefAve / length(ridgeMSE)

ridgeCoefAve
##        fixed.acidity     volatile.acidity          citric.acid 
##           0.22034365          -1.06909608           0.17773746 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##           0.06716620          -1.51888641           0.02071008 
## total.sulfur.dioxide              density                   pH 
##          -0.07601992         -60.13585517          -0.47194907 
##            sulphates              alcohol 
##           1.15435841           1.92691904
mean(ridgeMSE)
## [1] 0.4500541
quantile(ridgeMSE)
##        0%       25%       50%       75%      100% 
## 0.4002306 0.4393771 0.4516351 0.4596842 0.4903137

White wine

Lasso

x <- model.matrix(quality~.,white.wine.log)[,-1]
y <- white.wine.log[,"quality"]
lassoRes <- glmnet(x,y,alpha=1)
plot(lassoRes)

#With default λ’s eleventh variable doesn’t enter the model
cvLassoRes <- cv.glmnet(x,y,alpha=1)
#plot(cvLassoRes)

#cvLassoRes <- cv.glmnet(x,y,alpha=1,lambda=10^((-200:20)/80))
#plot(cvLassoRes)

predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.min)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)          -5.116687416
## fixed.acidity        -0.286411920
## volatile.acidity     -2.309866552
## citric.acid           0.046307121
## residual.sugar        0.171507922
## chlorides            -1.194220914
## free.sulfur.dioxide   0.346725554
## total.sulfur.dioxide -0.002465644
## pH                    0.635859902
## sulphates             0.670474810
## alcohol               4.037978134
predict(lassoRes,type="coefficients",s=cvLassoRes$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)          -3.3012462508
## fixed.acidity        -0.1490326911
## volatile.acidity     -2.0088516206
## citric.acid           .           
## residual.sugar        0.0717210982
## chlorides            -0.5729267959
## free.sulfur.dioxide   0.2246102553
## total.sulfur.dioxide -0.0004180048
## pH                    .           
## sulphates             0.1333045962
## alcohol               3.7316504071
#Similarly to what was seen above, optimal (in min-1SE sense) model by lasso includes five variables except for MYCT

lassoResScaled <- glmnet(scale(x),y,alpha=1)
cvLassoResScaled <- cv.glmnet(scale(x),y,alpha=1)
predict(lassoResScaled,type="coefficients",s=cvLassoResScaled$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                1
## (Intercept)           5.87790935
## fixed.acidity        -0.01983232
## volatile.acidity     -0.15534417
## citric.acid           .         
## residual.sugar        0.06387632
## chlorides            -0.01378313
## free.sulfur.dioxide   0.12817935
## total.sulfur.dioxide -0.03229294
## pH                    .         
## sulphates             0.01746241
## alcohol               0.39811840

Lasso on white wine returned comparable results with 5 variables as regsubsets, with difference that additionally “chlorides” and “sulphates” are included in lasso, but with small coefficent.. Test MSE is slightly higher 0.57 (vs 0.56). Resampling results with three variables (“volatile.acidity”,“free.sulfur.dioxide”,“alcohol”) are in top 3 coefficent in lasso.

lassoCoefCnt <- 0
lassoMSE <- NULL
for ( iTry in 1:30 ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
  cvLassoTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=1,lambda=10^((-120:0)/20))
  lassoTrain <- glmnet(x[bTrain,],y[bTrain],alpha=1,lambda=10^((-120:0)/20))
  lassoTrainCoef <- predict(lassoTrain,type="coefficients",s=cvLassoTrain$lambda.1se)
  lassoCoefCnt <- lassoCoefCnt + (lassoTrainCoef[-1,1]!=0)
  lassoTestPred <- predict(lassoTrain,newx=x[!bTrain,],s=cvLassoTrain$lambda.1se)
  lassoMSE <- c(lassoMSE,mean((lassoTestPred-y[!bTrain])^2))
}
mean(lassoMSE)
## [1] 0.5762332
quantile(lassoMSE)
##        0%       25%       50%       75%      100% 
## 0.5567583 0.5704082 0.5758174 0.5824260 0.6125151
lassoCoefCnt
##        fixed.acidity     volatile.acidity          citric.acid 
##                   22                   30                    0 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##                   30                   24                   30 
## total.sulfur.dioxide                   pH            sulphates 
##                   10                    6                   14 
##              alcohol 
##                   30

Ridge

ridgeRes <- glmnet(x,y,alpha=0)
plot(ridgeRes) 

cvRidgeRes <- cv.glmnet(x,y,alpha=0)
plot(cvRidgeRes)

cvRidgeRes$lambda.min
## [1] 0.04197234
cvRidgeRes$lambda.1se
## [1] 0.2698013
#With default λ’s the lowest MSE is attained for the least regularized model (for the lowest λ)

#cvRidgeRes <- cv.glmnet(x,y,alpha=0,lambda=10^((-50:60)/20))
#plot(cvRidgeRes)

#cvRidgeRes$lambda.min

#cvRidgeRes$lambda.1se

predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.min)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                1
## (Intercept)          -4.31557286
## fixed.acidity        -0.31502445
## volatile.acidity     -2.18189589
## citric.acid           0.09055180
## residual.sugar        0.15061177
## chlorides            -1.68626410
## free.sulfur.dioxide   0.32862652
## total.sulfur.dioxide -0.00237491
## pH                    0.64148161
## sulphates             0.64097515
## alcohol               3.76173901
predict(ridgeRes,type="coefficients",s=cvRidgeRes$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                 1
## (Intercept)          -1.500024080
## fixed.acidity        -0.345275490
## volatile.acidity     -1.698317641
## citric.acid           0.142207966
## residual.sugar        0.077099496
## chlorides            -2.797200270
## free.sulfur.dioxide   0.242778921
## total.sulfur.dioxide -0.001816626
## pH                    0.620833471
## sulphates             0.481160813
## alcohol               2.780037603
#As expected, for more regularized model (using 1SE rule) coefficients are smaller by absolute value than those at the minimum of MSE

ridgeResScaled <- glmnet(scale(x),y,alpha=0)
cvRidgeResScaled <- cv.glmnet(scale(x),y,alpha=0,lambda=10^((-50:60)/20))
predict(ridgeResScaled,type="coefficients",s=cvRidgeResScaled$lambda.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                1
## (Intercept)           5.87790935
## fixed.acidity        -0.03654364
## volatile.acidity     -0.13674338
## citric.acid           0.01197059
## residual.sugar        0.06739607
## chlorides            -0.05201741
## free.sulfur.dioxide   0.13791681
## total.sulfur.dioxide -0.08281277
## pH                    0.02223374
## sulphates             0.03872091
## alcohol               0.31683952

Top 5 variables in ridge are also selected by regsubsets. Resampling results with three varables (“alcohol”, “sulphates”,“volatile.acidity”) also are in top 3 coefficent in ridge. MSE is comparable to lasso.

ridgeCoefCnt <- 0
ridgeCoefAve <- 0
ridgeMSE <- NULL
for ( iTry in 1:30 ) {
  bTrain <- sample(rep(c(TRUE,FALSE),length.out=nrow(x)))
  cvridgeTrain <- cv.glmnet(x[bTrain,],y[bTrain],alpha=0,lambda=10^((-50:50)/20))
  ridgeTrain <- glmnet(x[bTrain,],y[bTrain],alpha=0,lambda=10^((-50:50)/20))
  ridgeTrainCoef <- predict(ridgeTrain,type="coefficients",s=cvridgeTrain$lambda.1se)
  ridgeCoefCnt <- ridgeCoefCnt + (ridgeTrainCoef[-1,1]!=0)
  ridgeCoefAve <- ridgeCoefAve + ridgeTrainCoef[-1,1]
  ridgeTestPred <- predict(ridgeTrain,newx=x[!bTrain,],s=cvridgeTrain$lambda.1se)
  ridgeMSE <- c(ridgeMSE,mean((ridgeTestPred-y[!bTrain])^2))
}
ridgeCoefAve <- ridgeCoefAve / length(ridgeMSE)

ridgeCoefAve
##        fixed.acidity     volatile.acidity          citric.acid 
##         -0.343210096         -1.672566887          0.145148960 
##       residual.sugar            chlorides  free.sulfur.dioxide 
##          0.077962901         -2.816598544          0.234733434 
## total.sulfur.dioxide                   pH            sulphates 
##         -0.001770157          0.596507682          0.487427508 
##              alcohol 
##          2.751838813
mean(ridgeMSE)
## [1] 0.5780262
quantile(ridgeMSE)
##        0%       25%       50%       75%      100% 
## 0.5468051 0.5664662 0.5769768 0.5901110 0.6071624

Sub-problem 5: PCA (10 )

Merge data for red and white wine (function rbind allows merging of two matrices/data frames with the same number of columns) and plot data projection to the first two principal components (e.g. biplot or similar plots). Does this representation suggest presence of clustering structure in the data? Does wine type (i.e. red or white) or quality appear to be associated with different regions occupied by observations in the plot? Please remember not to include quality attribute or wine type (red or white) indicator in your merged data, otherwise, apparent association of quality or wine type with PCA layout will be influenced by presence of those indicators in your data.

#Setting color to darker and bigger  for higher wine quality for better visualisation of results. Red color for red wine in the plot.
red.wine.tmp = red.wine
white.wine.tmp = white.wine
red.wine.tmp$color = 0
white.wine.tmp$color = 50/360

wine = rbind(red.wine.tmp,white.wine.tmp)
wine = wine[order(wine$quality),]
wine.inp = wine[,c(-12,-13)]

pc.wine = prcomp(wine.inp,scale=TRUE)
pc.wine.pc12 = pc.wine$x[,1:2]
plot(pc.wine.pc12,pch=16,cex=wine$quality^4/5400,xlim=c(-5,6.5),ylim=c(-6,7.5),col=hsv(wine$color, 1, 1-wine$quality/12))
arrows(0,0, pc.wine$rotation[,"PC1"]*12,pc.wine$rotation[,"PC2"]*12,length=0.1, lwd=1,angle=20, col="red")
text(pc.wine$rotation[,"PC1"]*13,pc.wine$rotation[,"PC2"]*13,rownames(pc.wine$rotation), col="red", cex=.4)

#Bigger circles means better quality
#red for red wine

We can see in this biplot presence of clear clustering structure in the data between wine types. Red wine tend to have higher than average amount of “valotile.accidity”, “sulphates”, “chlorites”, “fixed” than white wine. Whereas white wine tend to have higher than average amount of “total-/free.sulfur.dioxide” and “residual.sugar”. Generaly wine with higher quality have average and above average alcohol.

“free.sulfur.dioxide” and “total.sulfur.dioxide” are highly correlated.

Extra: model wine quality using principal components

Compute PCA representation of the data for one of the wine types (red or white) excluding wine quality attribute (of course!). Use resulting principal components (slot x in the output of prcomp) as new predictors to fit a linear model of wine quality as a function of these predictors. Compare resulting fit (in terms of MSE, r-squared, etc.) to those obtained above. Comment on the differences and similarities between these fits.

pc.red.wine.log = prcomp(red.wine.log[,-12],scale=TRUE)
pc.red.wine.log.pc12 = pc.red.wine.log$x[,1:2]
plot(pc.red.wine.log.pc12,pch=16,cex=red.wine.tmp$quality^4/2500,xlim=c(-5.5,6.5),ylim=c(-6,7.5),col=hsv(red.wine.tmp$color, 1, 1-red.wine.tmp$quality/12))
arrows(0,0, pc.red.wine.log$rotation[,"PC1"]*12,pc.red.wine.log$rotation[,"PC2"]*12,length=0.1, lwd=1,angle=20, col="red")
text(pc.red.wine.log$rotation[,"PC1"]*13,pc.red.wine.log$rotation[,"PC2"]*13,rownames(pc.red.wine.log$rotation), col="red", cex=.4)

#pc1.wine = pc.red.wine.log$rotation[,"PC1"]; pc1.wine[which.max(abs(pc1.wine))]

#pc2.wine = pc.red.wine.log$rotation[,"PC2"]; pc2.wine[which.max(abs(pc2.wine))]

pc.red.wine.log.var = (pc.red.wine.log$sdev^2)[1:11]
pve.red.wine.log.var = pc.red.wine.log.var/sum(pc.red.wine.log.var)

old.par <- par(mfrow=c(1,2),ps=10)
plot(pve.red.wine.log.var, main="Standardized", xlab="Principal Component", ylab="Proportion of Variance Explained ", ylim=c(0,1),type="b")
plot(cumsum(pve.red.wine.log.var), main="Standardized", xlab="Principal Component ", ylab=" Cumulative Proportion of Variance Explained ", ylim=c(0,1), type='b')

par(old.par)

pcr.red.wine.log=pcr(quality~.,data=red.wine.log,validation="CV",scale=TRUE)
summary(pcr.red.wine.log)
## Data:    X dimension: 1599 11 
##  Y dimension: 1599 1
## Fit method: svdpc
## Number of components considered: 11
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV          0.8078   0.8032   0.7469   0.6676   0.6658   0.6618   0.6622
## adjCV       0.8078   0.8031   0.7466   0.6675   0.6657   0.6617   0.6621
##        7 comps  8 comps  9 comps  10 comps  11 comps
## CV      0.6568   0.6555   0.6489    0.6493    0.6496
## adjCV   0.6566   0.6553   0.6487    0.6490    0.6493
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         28.593    46.93    61.23    71.98    81.14    87.12    92.20
## quality    1.327    14.98    32.00    32.42    33.28    33.33    34.52
##          8 comps  9 comps  10 comps  11 comps
## X          95.77    98.05     99.51     100.0
## quality    34.93    36.26     36.30      36.3
validationplot(pcr.red.wine.log,val.type = "MSEP")

x <- model.matrix(quality~.,red.wine.log)[,-1]
y <- red.wine.log[,"quality"]

set.seed (1)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
y.test=y[test]
    
set.seed(1)
pcr.red.wine.log=pcr(quality~.,data=red.wine.log,subset=train,validation="CV",scale=TRUE)

old.par <- par(mfrow=c(1,2),ps=10)
validationplot(pcr.red.wine.log,val.type="MSEP")
validationplot(pcr.red.wine.log,val.type="R2")

par(old.par)

pcr.red.wine.log.pred=predict(pcr.red.wine.log,x[test,],ncomp=6)
mean((pcr.red.wine.log.pred-y.test)^2)
## [1] 0.4045485
pcr.red.wine.log=pcr(y~x,scale=TRUE,ncomp=6)
summary(pcr.red.wine.log)
## Data:    X dimension: 1599 11 
##  Y dimension: 1599 1
## Fit method: svdpc
## Number of components considered: 6
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X   28.593    46.93    61.23    71.98    81.14    87.12
## y    1.327    14.98    32.00    32.42    33.28    33.33

6 components explain 87% variance of the data. MSE and R2 are lower by PCA but still comparable.